{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Accelerating inference for GPT-Neo with DeepSpeed\n",
        "This notebook is a companion of chapter 3 of the \"Domain Specific LLMs in Action\" book, author Guglielmo Iozzia, [Manning Publications](https://www.manning.com/), 2024.  \n",
        "The code in this notebook is to introduce readers to the [DeepSpeed](https://github.com/microsoft/DeepSpeed) library to accelerate inference for the [GPT-Neo model](https://github.com/EleutherAI/gpt-neo) for text generation tasks. It can be executed in the Colab free tier with hardware acceleration (GPU).  \n",
        "More details about the code can be found in the book's chapter."
      ],
      "metadata": {
        "id": "x6wXI87UM6Ms"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Install the missing dependencies in the Colab VM (DeepSpeed and HF's Accelerate only)."
      ],
      "metadata": {
        "id": "YDRRbZPpOLM9"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Txg6MF3QdT8D"
      },
      "outputs": [],
      "source": [
        "!pip install deepspeed accelerate"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Before loading the model, let's define a custom function to be used for benchmarking (latency measurement)."
      ],
      "metadata": {
        "id": "PwgbZqd6ijqN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from time import perf_counter\n",
        "import numpy as np\n",
        "\n",
        "def measure_latency(model, tokenizer, payload, device, generation_args={}):\n",
        "    input_ids = tokenizer(payload, return_tensors=\"pt\").input_ids.to(device)\n",
        "    latencies = []\n",
        "    # Do GPU warm up before benchmarking\n",
        "    for _ in range(2):\n",
        "        _ =  model.generate(input_ids, **generation_args)\n",
        "    # Runs used for measuring the latency\n",
        "    for _ in range(20):\n",
        "        start_time = perf_counter()\n",
        "        _ = model.generate(input_ids, **generation_args)\n",
        "        latency = perf_counter() - start_time\n",
        "        latencies.append(latency)\n",
        "\n",
        "    time_avg_ms = 1000 * np.mean(latencies)\n",
        "    time_std_ms = 1000 * np.std(latencies)\n",
        "    time_p95_ms = 1000 * np.percentile(latencies,95)\n",
        "\n",
        "    return f\"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\\- {time_std_ms:.2f};\", time_p95_ms\n"
      ],
      "metadata": {
        "id": "qWNFW20Dg8mX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Download the base GPT-Neo 2.7B model in half precision and the related tokenizer from the HF's Hub."
      ],
      "metadata": {
        "id": "AnTa3jQEdtYS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "from transformers import GPTNeoForCausalLM, GPT2Tokenizer, AutoTokenizer, AutoModelForCausalLM\n",
        "\n",
        "model_id = \"EleutherAI/gpt-neo-2.7B\"\n",
        "\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
        "model = AutoModelForCausalLM.from_pretrained(model_id,\n",
        "                                          torch_dtype=torch.float16,\n",
        "                                          device_map=\"auto\")\n",
        "print(f\"model is loaded on device {model.device.type}\")"
      ],
      "metadata": {
        "id": "CpMiuRhtdjUb"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do inference with the downloaded model to verify that everything is working as expected."
      ],
      "metadata": {
        "id": "MeZNahEFPGNN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "example = \"The story so far: in the beginning, the universe was created.\"\n",
        "\n",
        "input_ids = tokenizer(example, return_tensors=\"pt\").input_ids.to(model.device)\n",
        "logits = model.generate(input_ids,\n",
        "                        do_sample=True,\n",
        "                        num_beams=1,\n",
        "                        min_length=128,\n",
        "                        max_new_tokens=128,\n",
        "                        pad_token_id=50256)\n",
        "\n",
        "print(f\"prediction: \\n \\n {tokenizer.decode(logits[0].tolist()[len(input_ids[0]):])}\")"
      ],
      "metadata": {
        "id": "K47Okvzld-yh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Perform benchmark for the vanilla model. The previously defined ```measure_latency``` function is used.\n",
        "\n"
      ],
      "metadata": {
        "id": "h0jVn7MSPi_S"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "generation_args = dict(do_sample=True,\n",
        "                      max_length=300,\n",
        "                      pad_token_id=50256,\n",
        "                      use_cache=True\n",
        ")\n",
        "vanilla_results = measure_latency(model, tokenizer, example,\n",
        "                                  model.device, generation_args)\n",
        "\n",
        "print(f\"Vanilla model: {vanilla_results[0]}\")"
      ],
      "metadata": {
        "id": "BvEk6ZEChcXu"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's now optimize the base GPT-Neo 2.7B model for inference on GPU with DeepSpeed. The decision about which of the original model's layers have to be replaced is left to DeepSpeed itself here."
      ],
      "metadata": {
        "id": "8UfJvEb1eTXV"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "\n",
        "os.environ['MASTER_ADDR'] = 'localhost'\n",
        "os.environ['MASTER_PORT'] = '9999'\n",
        "os.environ['RANK'] = \"0\"\n",
        "os.environ['LOCAL_RANK'] = \"0\"\n",
        "os.environ['WORLD_SIZE'] = \"1\""
      ],
      "metadata": {
        "id": "tpb4h2IsUNxw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import deepspeed\n",
        "\n",
        "ds_model = deepspeed.init_inference(\n",
        "    model=model,\n",
        "    mp_size=1,\n",
        "    dtype=torch.float16,\n",
        "    replace_method=\"auto\",\n",
        "    replace_with_kernel_inject=True,\n",
        ")\n",
        "print(f\"model is loaded on device {ds_model.module.device}\")"
      ],
      "metadata": {
        "id": "hfXLPFHtecMz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Print the optimized model's architecture to this cell output to verify that some of the original model's layers have been replaced with DeepSpeed optimized kernel implementations."
      ],
      "metadata": {
        "id": "cTFt748TRmMB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "ds_model"
      ],
      "metadata": {
        "id": "XhYIvUXOg2E5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do inference with the optimized model to verify that everything is working as expected."
      ],
      "metadata": {
        "id": "RWBvqWiLRfGr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "input_ids = tokenizer(example, return_tensors=\"pt\").input_ids.to(model.device)\n",
        "logits = ds_model.generate(input_ids,\n",
        "                           do_sample=True,\n",
        "                           num_beams=1,\n",
        "                           min_length=128,\n",
        "                           max_new_tokens=128,\n",
        "                           pad_token_id=50256,\n",
        "                           use_cache=False\n",
        "                          )\n",
        "print(tokenizer.decode(logits[0].tolist()))"
      ],
      "metadata": {
        "id": "qhDqFBEken8u"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Perform now benchmark for the DeepSpeed optimized model. The previously defined ```measure_latency``` function is used."
      ],
      "metadata": {
        "id": "1pY3LIeuSMx-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "generation_args = dict(do_sample=True,\n",
        "                       max_length=300,\n",
        "                       pad_token_id=50256,\n",
        "                       use_cache=True)\n",
        "ds_results = measure_latency(ds_model, tokenizer, example,\n",
        "                             ds_model.module.device, generation_args)\n",
        "\n",
        "print(f\"DeepSpeed model: {ds_results[0]}\")"
      ],
      "metadata": {
        "id": "kRGcsLTDh-bT"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}